I’ve chosen data about Brooklyn real estate sales in 2015. It consist of more than 23 000 observations of 21 variables.
## 'data.frame': 23223 obs. of 21 variables:
## $ borough : num 3 3 3 3 3 3 3 3 3 3 ...
## $ neighborhood : Factor w/ 60 levels "BATH BEACH","BAY RIDGE",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ building.class.category : Factor w/ 44 levels "01 ONE FAMILY DWELLINGS",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ tax.class.at.present : Factor w/ 10 levels "1","1A","1B",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ block : Factor w/ 5525 levels "20","27","28",..: 3910 3911 3919 3920 3922 3922 3937 3938 3938 3940 ...
## $ lot : Factor w/ 1083 levels "1","2","3","4",..: 22 17 60 48 49 51 39 8 108 23 ...
## $ ease-ment : chr " " " " " " " " ...
## $ building.class.at.present : Factor w/ 128 levels "A1","A2","A3",..: 5 5 7 106 106 106 1 106 106 5 ...
## $ address : chr "8647 15TH AVENUE" "55 BAY 10TH STREET" "8620 19TH AVENUE" "1906 86TH STREET" ...
## $ apartment.number : chr NA NA NA NA ...
## $ zip.code : Factor w/ 39 levels "11201","11203",..: 27 27 13 13 13 13 13 13 13 13 ...
## $ residential.units : num 1 1 1 1 1 1 1 1 1 1 ...
## $ commercial.units : num 0 0 0 1 1 1 0 1 1 0 ...
## $ total.units : num 1 1 1 2 2 2 1 2 2 1 ...
## $ land.square.feet : num 1547 1933 2417 1900 1725 ...
## $ gross.square.feet : num 1428 1660 2106 2090 2112 ...
## $ year.built : num 1930 1930 1930 1931 1925 ...
## $ tax.class.at.time.of.sale : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
## $ building.class.at.time.of.sale: Factor w/ 132 levels "A0","A1","A2",..: 6 6 8 110 110 110 2 110 110 6 ...
## $ sale.price : num 758000 778000 0 1365000 1470000 ...
## $ sale.date : Date, format: "2015-03-31" "2015-06-15" ...
## borough neighborhood building.class.category tax.class.at.present block
## 1 3 BATH BEACH 01 ONE FAMILY DWELLINGS 1 6360
## 2 3 BATH BEACH 01 ONE FAMILY DWELLINGS 1 6361
## 3 3 BATH BEACH 01 ONE FAMILY DWELLINGS 1 6371
## 4 3 BATH BEACH 01 ONE FAMILY DWELLINGS 1 6372
## 5 3 BATH BEACH 01 ONE FAMILY DWELLINGS 1 6374
## 6 3 BATH BEACH 01 ONE FAMILY DWELLINGS 1 6374
## lot ease-ment building.class.at.present address
## 1 22 A5 8647 15TH AVENUE
## 2 17 A5 55 BAY 10TH STREET
## 3 60 A9 8620 19TH AVENUE
## 4 48 S1 1906 86TH STREET
## 5 49 S1 1964 86TH STREET
## 6 51 S1 1970 86TH STREET
## apartment.number zip.code residential.units commercial.units total.units
## 1 <NA> 11228 1 0 1
## 2 <NA> 11228 1 0 1
## 3 <NA> 11214 1 0 1
## 4 <NA> 11214 1 1 2
## 5 <NA> 11214 1 1 2
## 6 <NA> 11214 1 1 2
## land.square.feet gross.square.feet year.built tax.class.at.time.of.sale
## 1 1547 1428 1930 1
## 2 1933 1660 1930 1
## 3 2417 2106 1930 1
## 4 1900 2090 1931 1
## 5 1725 2112 1925 1
## 6 1725 2112 1931 1
## building.class.at.time.of.sale sale.price sale.date
## 1 A5 758000 2015-03-31
## 2 A5 778000 2015-06-15
## 3 A9 0 2015-09-16
## 4 S1 1365000 2015-05-29
## 5 S1 1470000 2015-05-06
## 6 S1 1790000 2015-04-30
At first I decided to look at price distribution
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 350000 802800 825000 169000000
It looks like we have zero values for about a quarter of rows and now I want to look at lowest values
##
## 0 1 5 7 9 10
## 8157 56 1 1 1 227
Then I exclude zeros and repeat summary
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 378000 660000 1237000 1100000 169000000
Now I’m interested in those $1 properties
## building.class.category address year.built
## 1 01 ONE FAMILY DWELLINGS 415 99TH STREET 1899
## 2 07 RENTALS - WALKUP APARTMENTS 263 BAY RIDGE AVENUE 1931
## 3 02 TWO FAMILY DWELLINGS 523 LEXINGTON AVENUE 1993
## 4 03 THREE FAMILY DWELLINGS 109A HART STREET 2005
## 5 07 RENTALS - WALKUP APARTMENTS 1077-79 BEDFORD AVENUE 1931
## 6 07 RENTALS - WALKUP APARTMENTS 165 QUINCY STREET 1931
## 7 08 RENTALS - ELEVATOR APARTMENTS 273 GATES AVENUE 1920
## 8 14 RENTALS - 4-10 UNIT 260 MARCUS GARVEY BOULEV 1931
## 9 01 ONE FAMILY DWELLINGS 1829 51ST STREET 1920
## 10 12 CONDOS - WALKUP APARTMENTS 3822A 15TH AVENUE NA
## gross.square.feet sale.price
## 1 652 1
## 2 5880 1
## 3 1802 1
## 4 3093 1
## 5 11520 1
## 6 5580 1
## 7 44460 1
## 8 4056 1
## 9 1344 1
## 10 0 1
Other values look normal so I assume it’s a kind of fictitious price
Let’s look at histogram with prices divided by 1000. There is a distribution with very long tale. Let’s look at all prices lower than 5 mln.
Most prices are distributed between 1000 and 1 000 000 with peaks around 500 000, 950 000 and 1 250 000, also we have much more prices that are slightly lower then 1 mln than that are slightly higher. After log transformation prices look nearly normal but some exceptions that are lower than 1000.
Next feature of interest for me is Gross square feet.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 1728 2889 2870 366000
Looks like here are also lots of zeros. Summary without zeros:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 65 1800 2480 4501 3435 366000
Let’s look at distribution of values greater than 0
Here also we have very long tail. I focus on values between 1 and 10000. Most values are between 1000 and 3500 square feet and peak is around 2000. Now I want to look at it after log transformation.
Now it looks more like normal but still with long tails.
Summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 1900 2223 2500 293000
Looks like here we also have lots of zeros and outliers.
Summary without zeros:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20 1882 2083 3284 2750 293000
Histogram:
Most values are distributed between 1500 and 3000 with spikes at round numbers (2000, 2500, 3000 …). The most common value is 2000.
After log transformation:
I decided to make new variable “total.square.feet” as a sum of “land.square.feet” and “gross.square.feet”.
summary of new variable:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 3818 5112 5265 589300
I still have 7500 zero values. Summary of non-zero values:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20 3780 4700 7551 6116 589300
Obviously I have here similar distribution with very long tale.
Then I want to make another variable - price per square foot
Summary of non-zero values:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01 110.40 166.80 219.60 244.40 12060.00
distribution:
Next feature of interest is year.built
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1800 1915 1930 1940 1960 2016 1993
It’s distribution:
Most buildings were built between 1895 and 1935 with other peaks in 1950 - 1956 and 2005-2015. With binwidth = 1 it is possible to notice spikes on round years (1900, 1920…). I assume some of these values are approximate.
Table of values’ counts:
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 5608 6767 6104 2292 827 162 549 90 234 54 31 7 49 11 12
## 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## 12 47 14 14 12 37 13 9 10 18 5 6 6 8 6
## 30 31 32 33 34 35 36 38 39 40 41 42 43 44 45
## 3 13 13 4 4 8 5 10 10 5 5 5 2 3 2
## 47 48 49 50 51 52 53 54 55 56 58 59 60 61 62
## 5 8 2 4 1 4 2 4 1 2 5 1 2 1 3
## 63 64 65 66 67 68 69 70 72 74 75 77 78 79 81
## 2 1 1 3 1 3 1 2 2 3 1 3 2 1 1
## 82 83 84 89 90 92 93 95 96 102 103 104 107 108 112
## 2 1 5 2 1 1 1 1 1 3 1 1 1 1 1
## 114 118 119 120 121 126 131 133 169 172 178 190 200 225 234
## 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1
## 268 270 334 338
## 1 1 1 1
Histogram:
Most properties have from 0 to 5 residential units.
##
## 0 1 2 3 4 5 6 7 8 9 10 11
## 20832 1827 321 100 61 34 16 7 4 1 3 1
## 12 13 15 16 24 28 29 30 54 201 355
## 2 1 1 2 4 1 1 1 1 1 1
Histogram:
Most properties have 0 commercial units.
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 3665 8211 5882 2743 845 262 590 105 266 59 53 17 46 19 6
## 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## 11 56 14 15 11 36 18 6 14 20 8 7 1 13 10
## 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44
## 5 11 11 2 3 8 4 3 8 9 4 8 7 1 5
## 45 47 48 49 50 51 52 53 54 56 58 59 60 61 62
## 3 5 9 2 4 1 5 1 4 2 5 3 2 1 3
## 63 64 65 66 67 68 69 70 72 75 77 78 79 81 82
## 2 1 1 3 1 3 1 3 2 3 3 2 2 1 1
## 83 84 85 86 89 90 94 95 96 97 102 103 104 107 108
## 1 4 1 1 1 2 1 1 1 1 3 1 1 1 1
## 112 114 118 120 121 123 126 131 135 169 172 178 192 200 201
## 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1
## 225 237 270 335 339 355
## 1 1 2 1 1 1
Histogram:
Most properties have between 0 and 5 total units.
##
## 3
## 23223
All observarions of this variable have the same value
Categorical variable with 60 levels
I’ve noticed that neighborhoods have different representation in the dataset. Some of them have small number of observations.
Categorical with 44 levels. Ten most common values:
##
## 02 TWO FAMILY DWELLINGS 01 ONE FAMILY DWELLINGS
## 6015 3107
## 10 COOPS - ELEVATOR APARTMENTS 03 THREE FAMILY DWELLINGS
## 2232 2173
## 13 CONDOS - ELEVATOR APARTMENTS 07 RENTALS - WALKUP APARTMENTS
## 1998 1925
## 15 CONDOS - 2-10 UNIT RESIDENTIAL 44 CONDO PARKING
## 722 695
## 09 COOPS - WALKUP APARTMENTS 04 TAX CLASS 1 CONDOS
## 569 526
Most properties are one-three family dwellings or condos.
Categorical with 10 levels
##
## 1 1A 1B 1C 2 2A 2B 2C 3 4
## 11295 393 314 134 5401 1609 465 1046 2 2503
I transformed this to factor which has 5525 levels . Most frequent values are:
##
## 8720 1890 2135 1896 152 286 7279 1217 2324 2348
## 135 109 96 95 91 90 69 54 53 53
I transformed it to factor too and received 1083 different values. Most frequent of them are:
##
## 1 11 6 12 18 21 17 14 35 13
## 795 322 314 288 288 283 282 280 277 275
##
##
## 23223
All observations have empty values.
Factor with 128 levels. Most common values:
##
## D4 B1 C0 R4 B3 B2 A5 B9 A1 A9
## 2230 2222 2164 1994 1251 1146 976 900 776 744
All values:
##
## A1 A2 A3 A4 A5 A7 A9 B1 B2 B3 B9 C0 C1 C2 C3
## 776 188 53 173 976 2 744 2222 1146 1251 900 2164 525 577 663
## C4 C5 C6 C7 C8 C9 D0 D1 D3 D4 D5 D6 D7 D8 D9
## 34 31 563 127 6 7 2 106 13 2230 7 18 34 2 7
## E1 E2 E7 E9 F1 F2 F4 F5 F9 G0 G1 G2 G4 G5 G6
## 52 5 5 110 11 3 10 13 77 31 67 24 7 7 10
## G7 G8 G9 GU GW I1 I4 I5 I6 I7 I9 J9 K1 K2 K4
## 173 4 40 4 3 3 6 5 4 7 2 1 135 83 149
## K5 K6 K7 K9 L1 L8 L9 M1 M2 M3 M4 M9 N2 N9 O1
## 8 1 3 1 1 1 3 22 1 2 1 12 2 4 11
## O2 O5 O7 O8 O9 P3 P5 P6 P9 Q9 R0 R1 R2 R3 R4
## 21 28 23 11 4 2 1 1 3 1 1 731 341 393 1994
## R5 R6 R7 R8 R9 RA RB RG RK RP RR RS RT RW S0
## 10 133 1 47 37 3 63 397 24 300 7 124 23 41 6
## S1 S2 S3 S4 S5 S9 T9 U7 U8 V0 V1 V2 V3 V5 V9
## 179 479 104 101 74 132 1 1 1 303 199 5 6 3 9
## W1 W2 W3 W8 W9 Y1 Z0 Z9
## 1 5 2 2 10 1 5 97
Character values that should represent uniqe buildings or apartments, I want to see if any of them repeat
##
## 163 WASHINGTON AVENUE 185 PACIFIC STREET 388 BRIDGE STREET
## 106 85 63
## 143 CLASSON AVENUE 184 KENT AVENUE
## 59 53
factor with 39 levels that can represent geographical location of building
Factor variable with 4 levels
##
## 1 2 3 4
## 12176 8448 2 2597
Most values are of class 1
Factor variable with 132 levels. Most frequent:
##
## D4 B1 C0 R4 B3 B2 A5 B9 A1 A9 R1 C3 C2 C6 C1
## 2230 2209 2173 1998 1256 1159 976 906 778 752 722 647 568 563 509
## S2 RG R3 R2 V0 RP V1 G7 A2 S1 A4 K1 K4 R6 S9
## 485 407 390 343 306 288 219 216 191 180 167 140 136 135 133
## C7 RS E9 S3 S4 Z9 D1 K2 F9 S5 G9 RB A3 R8 G2
## 130 122 114 104 102 97 95 84 79 74 70 66 53 51 48
## RW E1 R9 C4 G1 G0 O7 C5 D7 O9 M1 RK RT G6 R5
## 41 39 37 34 32 31 31 30 30 26 23 21 21 16 15
## E3 F4 F5 F1 M9 O1 O5 W9 O2 O8 RR D6 K5 V3 C9
## 14 14 14 13 12 12 12 10 9 9 9 8 8 8 7
## D9 G4 G5 I7 C8 D5 I4 K9 S0 V9 E7 I5 V2 Z0 F2
## 7 7 7 7 6 6 6 6 6 6 5 5 5 5 4
## I6 I9 N9 P9 W2 A7 E2 I1 K7 L9 V5 W8 D0 GU M3
## 4 4 4 4 4 3 3 3 3 3 3 3 2 2 2
## N2 P3 RA W3 A0 D3 D8 E4 G8 GW J9 K6 L1 L8 M2
## 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1
## M4 O6 P5 P6 Q9 R0 R7 T9 U7 U8 W1 Y1
## 1 1 1 1 1 1 1 1 1 1 1 1
I should have sales for the whole 2015 year.
Obviously most sales are made on week days with spike on June, 30 and a decline to the end of year.
There are 23223 property sales with 23 variables in this dataset. 8157 rows don’t have information about price. For others mean price is $1237000 and mean is $660000. Most properties are between 1000 and 3500 square feet and have between 1500 and 3000 square feet of land. Most buildings were constructed between 1895 and 1935.
Main features for me are price, gross square feet, land square feet.
Other features that I find interesting are year built, numbber of units, sale date, zipcode.
I created a variable “total square feet” for sum of gross and land square feet and a variable for price per square feet.
I’ve deleted zeros and obvious mistakes from “year.built”, for plots I deleted price by 1000 and made logarithmic transformations for price and area variables because of long-tailed distributions.
First I want to plot price vs gross square feet
Filter out zero values and zoom in:
Here we can see some vertical bands for round numbers and a horizontal stripe around $1000000 and also a lot of variance in price for the same square feet. Smothing layer shows that at average larger properties cost more. I want to calculate the correlation coeffitient:
## cor
## 0.5014883
There is positive correlation, but not very strong.
I delete zero values and look closer
Here also we can see strong vertical bands and a lot of differences in prices for the same square feet.
## cor
## 0.3504385
There is low positive correlation.
Now I filter out zero values and add linear regression line:
As expected larger properties at average cost more, but there are also a lot of variance due to other variables.
Corellation coefficient for total square feet and price:
## cor
## 0.4821405
Correlation is lower than 0.5 suggesting that variables are correlated but not very strong.
Now I’m interested in comparing price per square foot vs. total sqare feet, may be there is a difference in price for small or large properties.
Plot shows a lot of variance in price but no obvious increase or decrease.
I can see more variance in prices for housed built in years 1899-1931 and 2000-2015 y but I’m not sure about changes in mean prices so I want to cut years by decades and make a boxplot.
There are small differences in median price for different decades. On average properties that were built in the beginnig of this and last century cost more than those built in the middle of last century.
Obviously, a lot of difference in price per square foot is explaned by location. I want to make a boxplot after filtering out zero prices and zoom on prices lower than $2500:
Also we can see a lot of difference in sale prices per unit
Next I want to look at distribution of floor area across neighborhoods:
For example, properties that are sold in Downtown - Fulton mall area are on average larger than in Windsor Terrace. But if I look at area per unit, differences are not so large:
I want to look closer to see difference:
Now I want to look at for distribution of prices for different tax classes:
Next - prices per square foot:
Distribution for class 2C looks rather different from others
Median price for class 2C is conciderably higher then others.
Most small residential properties (class 1) were sold in Bedford-Stuyvesant, condos (class 2) - in Park Slope, commercial (class 4) - in Bedford-Stuyvesant and Williamsburg-North. Next I want to look at proportions of different tax classes.
There is some difference in median prices for diffferent building classes. For example indoor public and cultural facilities cost per foot more then educational facilities
Among big sales there are stripes around specific dates like 2015.03.01 or 2015.06.30.
Red line shows mean price for every day of year. Now I want to look how price per foot changes with time
There is a slight increase to the end of year.
There is positive correlation between price and gross and land square feet, obviously larger properties cost more, but it is not so stront to explaine all the variance in price. Houses built in the beginning and in the end of last century at average have higher prices per square foot than those built in the middle of the century. Price per square foot varies significantly in different neighborhoods. For tax class 2C price distribution looks different then others, has higer mean and variance. There is some difference in median prices for diffferent building classes. For example indoor public and cultural facilities cost per foot more then educational facilities
There is some difference in median prices for diffferent building classes. For example indoor public and cultural facilities cost per foot more then educational facilities. Higest number of small residential properties (class 1) were sold in Bedford-Stuyvesant, condos (class 2) - in Park Slope, commercial (class 4) - in Bedford-Stuyvesant and Williamsburg-North.
Sale price positively corellates with floor and land area, also price is strongly related to neighborhood.
Tax class 1 points are mostly situated in lower left corner, class 2 in lower middle and class 3 are disperced around. Now I,m interested in floor area distribution of different tax classes:
Properties larger than 4000 feet are mostly commercial or condo, and most small residentials have area less than 5000 feet.
Next I divided price and square footage by number of units and used a log scale to see relation beetween price and size of one unit across tax classes.
First I noticed a group of points with prices lower than $1000, this looks strange to me, maybe these are mistakes. Looks like units of class 2 at average are cheaper and smaller than class 1 and units of class 4 are larger and more expensive. I wonder does price rise at the same rate as floor area or maybe in large properties one square foot cost less? I plotted price for square foot against floor area for unit in log scale:
There is some evidence of downward trend for tax class 2 and bigger commercial properties.
Now I’m interested if price and square footage depends on the year.
Looks like newer buildigs are slightly cheaper.
Here I can notice that among the properties built in the beginning of the last century commercial (class 4) properties have higher prices than small residential (class 1) and condos (class 2). On the other hand, commercial properties built in 21st century mostly cost less than residential and condos.
ggplot(filter(sales, price.square.foot >0),
aes(year.built.bucket,price.square.foot)) +
geom_boxplot(aes(fill = tax.class.at.time.of.sale) ) +
coord_cartesian(ylim = c(0,2000))+
theme(axis.text.x = element_text(angle=60, hjust=1))
If I look at price per square foot I see no obvious pattern.
Now I’ m interested in distribution of prices for unit across neighborhoods clored by tax class.
Here we can see that distribution varies significantly in different neighborhoods, for example, I notice clusters of lower price class 4 points in Bedford Stuyvesant and Williamsbourg and more expencive class 2 properties in the same Williamsbourg.
I noticed a group of points with prices lower than $1000, this looks strange to me, maybe these are mistakes. Looks like units of class 2 at average are cheaper and smaller than class 1 and units of class 4 are larger and more expensive. Properties larger than 4000 feet are mostly commercial or condo, and most small residentials have area less than 5000 feet. Price distribution varies significantly in different neighborhoods, for example, I notice clusters of lower price class 4 points in Bedford Stuyvesant and Williamsbourg and more expencive class 2 properties in the same Williamsbourg.
Among the properties built in the beginning of the last century commercial (class 4) properties have higher prices than small residential (class 1) and condos (class 2). On the other hand, commercial properties built in 21st century mostly cost less than residential and condos.
On the log scale distribution looks almost normal with exeption of outliers lower than $1000. Most data are spread between $100 000 and $ 10 000 000 with the mode around $1 000 000. It is interesting that there are a lot more sales just below $1 mln than at $1 mln. I can suppose this is kind of a psycological number or it is connected with tax regulations.
On this plot I can notice that among the properties built in the beginning of the last century commercial (class 4) properties have higher prices than small residential (class 1) and condos (class 2). On the other hand, commercial properties built in 21st century mostly cost less than residential and condos.
This plot shows the relationship between unit sale price and floor area in different tax classes. Here we can see that properties of class 2 (condominiums and coops) mostly have area less than 1000 square feet and cost less than $1 mln. Commercial properties (class 4) are at average larger and cost more than $1 mln. Smooth layer shows that at general price goes up with increase in size, but we have a lot of variance due to other variables.
This dataset contains information about c. 23000 real estates sold in Brooklyn, NY in 2015, described by 21 variables. Analysing individual variables I’ve found that main features of interest (sale price, gross square feet, land square feet) have significant proportion of missing values. Some categorical values had more than 20 levels and it made visualisation more difficult. As expected, I found that price and size variables have distributions with very long right tale which made me use log transformation.
After that I explored relationships between price and floor area, neghborhood, building year and tax class. Obviously price is positively correlated with floor area but a lot of variance depends on location. I’m interested in further exploration of reasons behind relationships between price and year built and price and tax class.
Also it would be interesting to build a model for price prediction and find some methods for imputation of missing data. As real estate prices change with time it will be in my opinion the main limitaion to use of the model built on this data.